official statistics
Which Imputation Fits Which Feature Selection Method? A Survey-Based Simulation Study
Schwerter, Jakob, Romero, Andrés, Dumpert, Florian, Pauly, Markus
Tree-based learning methods such as Random Forest and XGBoost are still the gold-standard prediction methods for tabular data. Feature importance measures are usually considered for feature selection as well as to assess the effect of features on the outcome variables in the model. This also applies to survey data, which are frequently encountered in the social sciences and official statistics. These types of datasets often present the challenge of missing values. The typical solution is to impute the missing data before applying the learning method. However, given the large number of possible imputation methods available, the question arises as to which should be chosen to achieve the 'best' reflection of feature importance and feature selection in subsequent analyses. In the present paper, we investigate this question in a survey-based simulation study for eight state-of-the art imputation methods and three learners. The imputation methods comprise listwise deletion, three MICE options, four \texttt{missRanger} options as well as the recently proposed mixGBoost imputation approach. As learners, we consider the two most common tree-based methods, Random Forest and XGBoost, and an interpretable linear model with regularization.
- Europe > Austria > Vienna (0.14)
- Europe > Germany > North Rhine-Westphalia > Arnsberg Region > Dortmund (0.04)
- Europe > Germany > Hesse > Darmstadt Region > Wiesbaden (0.04)
- (12 more...)
- Government (1.00)
- Education > Educational Setting (0.93)
- Health & Medicine (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.51)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)
Leveraging Machine Learning for Official Statistics: A Statistical Manifesto
Puts, Marco, Salgado, David, Daas, Piet
It is important for official statistics production to apply ML with statistical rigor, as it presents both opportunities and challenges. Although machine learning has enjoyed rapid technological advances in recent years, its application does not possess the methodological robustness necessary to produce high quality statistical results. In order to account for all sources of error in machine learning models, the Total Machine Learning Error (TMLE) is presented as a framework analogous to the Total Survey Error Model used in survey methodology. As a means of ensuring that ML models are both internally valid as well as externally valid, the TMLE model addresses issues such as representativeness and measurement errors. There are several case studies presented, illustrating the importance of applying more rigor to the application of machine learning in official statistics.
- North America > United States > New York (0.04)
- North America > United States > Kansas > Butler County (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- (2 more...)
A step towards the integration of machine learning and small area estimation
The use of machine-learning techniques has grown in numerous research areas. Currently, it is also widely used in statistics, including the official statistics for data collection (e.g. satellite imagery, web scraping and text mining, data cleaning, integration and imputation) but also for data analysis. However, the usage of these methods in survey sampling including small area estimation is still very limited. Therefore, we propose a predictor supported by these algorithms which can be used to predict any population or subpopulation characteristics based on cross-sectional and longitudinal data. Machine learning methods have already been shown to be very powerful in identifying and modelling complex and nonlinear relationships between the variables, which means that they have very good properties in case of strong departures from the classic assumptions. Therefore, we analyse the performance of our proposal under a different set-up, in our opinion of greater importance in real-life surveys. We study only small departures from the assumed model, to show that our proposal is a good alternative in this case as well, even in comparison with optimal methods under the model. What is more, we propose the method of the accuracy estimation of machine learning predictors, giving the possibility of the accuracy comparison with classic methods, where the accuracy is measured as in survey sampling practice. The solution of this problem is indicated in the literature as one of the key issues in integration of these approaches. The simulation studies are based on a real, longitudinal dataset, freely available from the Polish Local Data Bank, where the prediction problem of subpopulation characteristics in the last period, with "borrowing strength" from other subpopulations and time periods, is considered.
- Europe > Poland > Silesia Province > Katowice (0.04)
- North America > United States > New York (0.04)
- Europe > Germany (0.04)
- (9 more...)
- Energy (0.88)
- Banking & Finance (0.67)
The Applicability of Federated Learning to Official Statistics
Stock, Joshua, Hauke, Oliver, Weißmann, Julius, Federrath, Hannes
This work investigates the potential of Federated Learning (FL) for official statistics and shows how well the performance of FL models can keep up with centralized learning methods. FL is particularly interesting for official statistics because its utilization can safeguard the privacy of data holders, thus facilitating access to a broader range of data. By simulating three different use cases, important insights on the applicability of the technology are gained. The use cases are based on a medical insurance data set, a fine dust pollution data set and a mobile radio coverage data set - all of which are from domains close to official statistics. We provide a detailed analysis of the results, including a comparison of centralized and FL algorithm performances for each simulation. In all three use cases, we were able to train models via FL which reach a performance very close to the centralized model benchmarks. Our key observations and their implications for transferring the simulations into practice are summarized. We arrive at the conclusion that FL has the potential to emerge as a pivotal technology in future use cases of official statistics.
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- Banking & Finance (0.87)
Changing Data Sources in the Age of Machine Learning for Official Statistics
De Boom, Cedric, Reusens, Michael
Data science has become increasingly essential for the production of official statistics, as it enables the automated collection, processing, and analysis of large amounts of data. With such data science practices in place, it enables more timely, more insightful and more flexible reporting. However, the quality and integrity of data-science-driven statistics rely on the accuracy and reliability of the data sources and the machine learning techniques that support them. In particular, changes in data sources are inevitable to occur and pose significant risks that are crucial to address in the context of machine learning for official statistics. This paper gives an overview of the main risks, liabilities, and uncertainties associated with changing data sources in the context of machine learning for official statistics. We provide a checklist of the most prevalent origins and causes of changing data sources; not only on a technical level but also regarding ownership, ethics, regulation, and public perception. Next, we highlight the repercussions of changing data sources on statistical reporting. These include technical effects such as concept drift, bias, availability, validity, accuracy and completeness, but also the neutrality and potential discontinuation of the statistical offering. We offer a few important precautionary measures, such as enhancing robustness in both data sourcing and statistical techniques, and thorough monitoring. In doing so, machine learning-based official statistics can maintain integrity, reliability, consistency, and relevance in policy-making, decision-making, and public discourse.
- Research Report (1.00)
- Overview (1.00)
- Information Technology > Security & Privacy (1.00)
- Law (0.94)
How international collaboration is advancing machine learning in official statistics
New technologies and data sources have tremendous potential to improve statistical production. They offer a way to generate statistics in a more timely, accurate and cost-efficient manner. Yet, keeping up with the pace of change is challenging, especially for National Statistical Organisations (NSOs) that must innovate with care to maintain a "gold standard" in their outputs. International cooperation between NSOs and other official statistical bodies is one way to help accelerate change in a responsible way. In 2021, the Office for National Statistics (ONS) and the United Nations Economic Commission for Europe (UNECE) Machine Learning Group (ML 2021) demonstrated the benefits of international cooperation for technological advance.
- Europe (0.58)
- North America > Mexico (0.05)
Join the ONS-UNECE Machine Learning Group 2021 Webinar 19 November 2021
Machine Learning (ML) holds great potential for statistical organisations. It can make the production of statistics more efficient by automating specific processes or assisting humans in carrying out the process. It also allows statistical organisations to use new types of data such as social media data and imagery. Many national statistical offices (NSOs) are investigating how ML can be used to increase the relevance and quality of official statistics in an environment of growing demands for trusted information, rapidly developing and accessible technologies, and numerous competitors. Machine learning is revolutionising the way statistical organisations produce statistics.
- North America > United States (0.20)
- Asia > Indonesia (0.10)
- North America > Mexico (0.07)
Dynamic Question Ordering in Online Surveys
Early, Kirstin, Mankoff, Jennifer, Fienberg, Stephen E.
Online surveys have the potential to support adaptive questions, where later questions depend on earlier responses. Past work has taken a rule-based approach, uniformly across all respondents. We envision a richer interpretation of adaptive questions, which we call dynamic question ordering (DQO), where question order is personalized. Such an approach could increase engagement, and therefore response rate, as well as imputation quality. We present a DQO framework to improve survey completion and imputation. In the general survey-taking setting, we want to maximize survey completion, and so we focus on ordering questions to engage the respondent and collect hopefully all information, or at least the information that most characterizes the respondent, for accurate imputations. In another scenario, our goal is to provide a personalized prediction. Since it is possible to give reasonable predictions with only a subset of questions, we are not concerned with motivating users to answer all questions. Instead, we want to order questions to get information that reduces prediction uncertainty, while not being too burdensome. We illustrate this framework with an example of providing energy estimates to prospective tenants. We also discuss DQO for national surveys and consider connections between our statistics-based question-ordering approach and cognitive survey methodology.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.14)
- North America > United States > New York (0.04)
- North America > United States > District of Columbia > Washington (0.04)
- (3 more...)
- Research Report > Experimental Study (1.00)
- Questionnaire & Opinion Survey (1.00)
- Overview (1.00)
- Research Report > Strength High (0.93)